An Alternative Model of Amino Acid Replacement 
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Abstract 

The observed correlations between pairs of homologous protein 
sequences are typically explained in terms of a Markovian dy- 
namic of amino acid substitution. This model assumes that ev- 
ery location on the protein sequence has the same background 
distribution of amino acids, an assumption that is incompatible 
with the observed heterogeneity of protein amino acid profiles 
and with the success of profile multiple sequence alignment. We 
propose an alternative model of amino acid replacement during 
protein evolution based upon the assumption that the variation 
of the amino acid background distribution from one residue to 
the next is sufficient to explain the observed sequence corre- 
lations of homologs. The resulting dynamical model of inde- 
pendent replacements drawn from heterogeneous backgrounds is 
simple and consistent, and provides a unified homology match 
score for sequence-sequence, sequence-profile and profile-profile 
alignment. 

Introduction 

During evolution, a protein's amino acid sequence is altered by 
the insertion and deletion of residues and by the replacement of 
one residue by another. In principle, the alignment of proteins 
sequences and the subsequent detection of protein homologs and 
the inference of protein phylogenies requires a dynamical model 
of this sequence evolution. The most common and widely used 
residue replacement dynamics is the standard Dayhoff model, 
which assumes that the substitution probability during some time 
interval depends only on the identities of the initial and replace- 
ment residues and that the dynamics is otherwise homogeneous 
along the protein chain, and between protein families, and across 
evolutionary epochs. In other words, under this model the dynam- 
ics of amino acid substitution resembles a continuous time, first 



order Markov chain (.Davhoff ef a/.L Il972l 1978 ; iGonnet et ail 
1 1 9921: [Tones et aZlll992HMuller & VingronlbQOOl) . 



However, it has long been known that this widely used Marko- 
vian substitution model is fundamentally unsatisfactory. One ma- 
jor problem is that th e short and lon^ time substitution dynam- 
ics are incom patible ( Gonnet ef a/. , 1992: Be nner et a/1 I994t 
iMiiller & vingroa 2000) . Bennerefa/. (1994) suggests that this 
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is because at short evolutionary times the patterns of substitu- 
tion are influenced by single base mutations between neighboring 
codons, whereas for more diverged sequences the genetic code is 
irrelevant and the patterns of replacement are dominated by the 
selection of chemically and structural compatible residues. 

A more serious problem with the Dayhoff Markovian model 
is that it assumes that every residue in every protein has the 
same background distribution of amino acids and that protein se- 
quences rapidly evolve to this uninteresting equilibrium. In ac- 
tuality, the amino acid background distribution varies markedly 
from one residue position to the next, as can be seen, by ex- 
ample, in figure n These large site-to-site variations are stable 
across relatively long evolutionary time-scales, and they account 
for the success of protein hidden Markov models and other profile 
based multiple sequence alignment methods. (See, for example, 
Siolander et al.^ .1996: Durbin ef al.. J998) Profile methods can 
detect substantially mor e remote homologies than pairwise align- 
ment jPark et a/.L Il998'. Green and Brenner, Unpublished data). 
In short, the dynamics of amino acid substitution are not Marko- 
vian, stationary, nor homogeneous, and the prediction of rapidly 
decaying sequence correlations is at odds with the success of pro- 
file based remote homology detection. 

A natural solution to the limitations of the Markov model 
is to assume that residue replacement is governed by differ- 
ent Markov processes for each position, each process poten- 
tially possessing its own background distribution and substi- 
tution probabilities. The appropriate Markov matrix for a 
particular protein position is chosen based upon predictions 



of the protein 


structure, or directly from the sequence data. 


( Goldman et al. 


'l996;'Thorne ef aZ.''l996' 


'ToDham ef a/.Vl997 


Goldman et al... 


.1998: Koshi & Goldstein. 


1998; Dimmicefa/. 


bOOOllLartillot & FhilioDel l2004 However, this aonroach is both 



computationally and conceptually complex. 

Here, we propose that the observed sequence correlation be- 
tween diverged homologs is principally due to the heterogeneous, 
stable, background distribution of each protein site and, therefore, 
that a Markovian amino acid replacement dynamics is overly 
complicated and possible unnecessary for the accurate construc- 
tion of protein sequence alignments and phylogenies. As an al- 
ternative, we construct a dynamical model of amino acid replace- 
ment that explicitly assumes that each protein site has a different 
equilibrium distribution of the 20 canonical amino acids (which 
we refer to as that site's amino acid background, 9) and that the 
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Figure 1 : The amino acid background distribution of a site within 
a protein is often stable across large evolutionary time-scales, but 
varies markedly form one site to another This figure illustrates 
the helix-turn-helix motif from the CAP family of homodimeric 
DNA binding proteins. The height of each letter corresponds 
to the amino acid frequency in a multiple alignm ent of 100 di- 
verse, homologous sequences. (For details, see ICrooks et al\ 
[2004) (Schneider & Stephens, 1990) These background distribu- 
tions are determined by structural, functional and evolutionary 
constraints. For example, positions 180, 181 and 185 are critical 
to the sequence specific binding of the protein to DNA, the con- 
served glycine at position 177 is located on inside of the turn be- 
tween the helices, and the buried sites 172, 176, 178, 183, 187 and 
190 contain mostly hydrophobic residues. It should be noted that 
the correlations inherent in these background distributions are far 
stronger than can be explained by local structural features (such 
as buria l and secondary structure) alone. ( Crooks & Brenner. 
l2004rJ/J). 



residue distribution of each site rapidly (relative to evolutionary 
time scales) relaxes to this local, site-specific equilibrium. These 
distributions themselves conform to a probability distribution of 
backgrounds, P{6), which we may discover by studying many 
families of homologous proteins (See Fig.|2ji. We do not need to 
model the short time dynamics with any great accuracy, since the 
alignment of highly conserved homologs is relatively straightfor- 
ward. Therefore, we will assume that replacement residues are 
randomly sampled from the background distribution of that site. 
Consequentially, (and in direct contrast to the Markov model) 
the replacement residue is conditionally independent of the initial 
amino acid at all times. Note that multiple substitutions are statis- 
tically equivalent to a single substitution (since a single mutation 
is sufficient to relax a site to local equilibrium) and that a residue 
can be replaced by the same amino acid type. The resulting dy- 
namic is a site specific, continuous time, zeroth order Markov 
chain, similar in spirit to Felsenstein's (1981) model of nucleic 
acid substitution. The crucial difference is that the initial and re- 
placement residues are non-trivially correlated because both have 
been sampled from the same background distribution, whereas 
non-homologous residues are drawn from different backgrounds. 
Bruno has used essentially the same dynamical model discussed 
here, albeit without incorporating the background prior distribu- 
tion, to find maximum likelihood estimates of site-specific amino 
acid frequencies ( Bruna .1996). 

Under our residue replacement model, the principle origin of 
sequence correlation between diverged homologs is the back- 
ground distribution of each protein site. This is also the cen- 
tral idea underlying profile based multiple sequence alignment 




Hydrophilic 



Figure 2: The distribution of amino acid backgrounds is het- 
erogeneous and multimodal. This ternary scatter plot rep- 
resents 5000 randomly sampled distributio ns drawn from the 
dist.2 0comp Dirichlet mixture model dKamhisi 11994 of 
amino acid backgrounds. Each has been projected onto the three 
dimensional subspace of hydrophobic (C, F, I, L, M, V, W, Y), 
neutral (A, G, P, S, T) and hydrophilic (D, E, H, K, N, Q, R) 
residues. Major peaks in the probability density are located 
around (0.5, 0.25, 0.25) hydrophobic/neutral/hydrophilic, and to- 
wards the vertices of this ternary plot. Thus, if we observe sev- 
eral homologous hydrophilic residues (e.g. position 179 or 184 in 
fig.n* we can be reasonably confident that additional homologous 
residues will also be hydrophilic. 



algorithms. Therefore, we are not proposing a radically dif- 
ferent method for homolog detection or sequence alignment; 
rather we are proposing a concrete and consistent dynamics for 
the underlying evolutionary process. The implications of this 
dynamics can be readily extended to cover not only profile- 
sequence based alignment, but also profile-profile and pairwise 
sequence-sequence alignment. Moreover, when we consider pair- 
wise, sequence alignment below, we find that our model is es- 
sentially equivalent to the standard pairwise alignment methods, 
as they are used in practice. This alternative dynamical model 
of amino acid replacement is biologically reasonable, conceptu- 
ally straightforward and can adequately explain many of the ob- 
served patterns of homolog sequence correlation without invok- 
ing a Markovian dynamic. 

The correlations between pairs of homologous residues can be 
summarized by an amino acid substitution matrix, S, whose en- 
tries represent the log probability of observing the homologous 
pair of amino acids qij in a properly aligned pair of homologous 
proteins, against the probability ViV ^ of independen tly observing 
the residues in unrelated sequences (lAltschullll99lh . 

5y = |log^ (1) 

Units of one third bits are traditional for substitution matrices 
(base 2 logarithm, A = 1/3, ~ -tj digits), although the scaUng 
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Figure 3: This substitution matrix (Eq.[2 a conventional descrip- 
tion of amino acid replacement propensities) has been directly 
constructed from the dist.20comp Dirich let mixture mod el of 
amino acid background probabilities (Fig. 2: lKamlus[[l995l) . us- 
ing the conditionally independent substitution model of amino 
acid replacement (Eqs. 3-10). As a consequence of the hetero- 
geneity and stability of amino acid background probabilities - 
illustrated in figs. and |2l- the amino acid identity of a pair of 
alignable, homologous residues is non-trivially correlated over 
long evolutionary time scales, simply because both residues are 
drawn from the same background. Scores are in units of ^ bits, 
rounded to the nearest integer. 



is arbitrary. Assuming conditionally independent replacements, 
we can directly construct the large time limit substitution matrix 
from the background probability distribution, P{9). Fortunately, 
this large distribution has previously been investigated and pa- 
rameterized by fitting many columns from multiple alignments 
of homolo gous pr otein seq uences to a m ixture of Dirichlet dis- 
tributions tearplusk .1995.: iDurbin et ali Fig. l2l displays 
a pr ojection of the dist . 2 0comp parameterization llKarplusi 
1 1995 ) and Fig.|3ldisplays the corresponding substitution matrix. 
The mathematical details of matrix construction are given below. 

At shorter evolutionary times there is a significant chance that 
no mutation has occurred at all, resulting in an enhanced prob- 
ability of amino acid conservation. Let c be the probability of 
zero mutation events, then the substitution matrix, adjusted for 
the possibility of zero mutations, is 



5.,(c)=log^ 



cpiSij + (1 - c)g,j 



PiPj 



(2) 



where 6ij is the Kronecker delta function. A reasonable default 
model for the conservation probability c would be to assume that 
substitutions are Poissonian. Then c = exp(— t/r), where r is 
the mean time between replacements. Note that although replace- 
ment is Markovian (albeit zeroth order), the dynamic decay of 
residue correlations at a position is not, due to the heterogene- 
ity of the amino acid background at that position (an unobserved 
hidden variable). 
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Figure 4: Schematic representation of the decay of sequence cor- 
relation with evolutionary time. There is an initial rapid reduction 
in correlation on the time scale of single residue substitutions. 
Under the standard Markov model, this exponential decay would 
continue. However, under the profile model the correlations in- 
stead limit towards a plateau value, due to the heterogeneity of 
the background amino acid distribution. These are the correla- 
tions captured by the substitution matrix of Fig.|3l Finally, over a 
second, much longer time scale the sequence correlations decay 
towards insignificance due to changes in the site-specific back- 
ground distribution. 



Eq. |2] gives the log odds of aligned residues, given the inter- 
sequence divergence. Conversely, given a prior on the parameter 
c and a fixed alignment we can invert Eq.|2land estimate the di- 
vergence between sequences. Conserved residues indicate small 
divergences and unconserved pairs argue for large divergences, 
although different pairs are weighted differently. 

As evolution proceeds, the background distribution of a site 
may itself change, due, for example, to a change in structure of 
that part of the protein. This will result in a loss of homology 
signal under our model, and it may no longer be possible to align 
the diverged residues, nor to recognize them as homologs. This 
is schematically illustrated in Fig.|4] 

Various families of substitution matrices have been developed, 
included PAM, BLOSUM and VTML. Different members of the 
same family represent different degrees of sequence divergence. 
In principle, we should match the divergence inherent in the sub- 
stitution matrix to t he div ergence of the pair of sequences we wish 
to align (lAltschul[l993l) . However, this is computationally ex- 
pensive, and, in practice, a single matrix is chosen based on its 
ability to aUgn remote homologs, on the grounds that m atching 
close homologs is relatively easy terenner et all fl998i) . Under 
the Markov model, the chosen matrix has no particular signifi- 
cance. On the other hand, under our model there is a natural, 
non-trivial, long time limit matrix (Fig. |3j Eq. |2l c = 0). This 
matrix represents the sequence correlations at any time after the 
first few mutations, and before the underlying amino acid back- 
ground itself diverges. (Fig.|4} 

Figures Is] and |6l demonstrate that the dist.2 0comp ma- 
trix represents a similar level of evolutionary divergence, and 
similar patterns of substitution as BLOSUM62 and VTML160, 
two substitution matrices commonly used for pairwise sequence 
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Figure 5: Comparison between the bit scores of BLOSUM62 faeni koff & HenikofA fl992i) . VTML160 (" Miiller et aH l2002h and 
dist . 2 Ocomp (Fig.|3} substitution matrixes. All three matrices reflect similar levels and patterns of sequence divergence, but have 
been derived using very different approaches. The BLOSUM matrices are empirical, the VTML family are based upon the Markov 
model of amino acid substitution, and the dist . 2 Ocomp matrix is based upon the conditionally independent substitution model. 



alignment and remote homology detection. These three matrices 
have been created using very different evolutionary models; the 
dist . 20comp matrix is based upon our heterogeneous back- 
ground/independent substitution model, and the di st . 2 comp 
background distribution is, in turn, derived from the columns 
of many multiple alignments of homologous protein sequences; 
the popular BLOSUM matrices are empirically derived from 
the BLOCKS database of reliable protein sequence alignments 
(Henikoff & Henikoff, 1992; Henikoff ef a/. , 2000); and the clas- 
sic PA M (toavhoffef adll978h and modern VTML ( iMuUer et al\. 
12002 ^ matrix families are explicitly based upon the Markovian 
model of amino acid replacement. In a recent evaluation of 
pairwise remote homology detection, the VTML160 matrix was 
found to be more effective than any other VTML, PAM or BLO- 
SUM matrix. (Green & Brenner, 2002) However, as can be seen 
in figure |6l the difference in remote homology detection ability 
of the three matrices is relatively small. 



In summary, the important sequence correlations can be ad- 
equately explained by assuming conditionally independent re- 
placements drawn from background distributions that vary from 
site-to-site, but are stable over evolutionary time-scales. The stan- 
dard, Markovian model of amino acid replacement is unneces- 
sary, overly complicated and inconsistent with observed substitu- 
tion patterns. 



This alternative, heterogeneous background, independent sub- 
stitution model may be particularly useful for simultaneous se- 
quence alignment and phylogenetic tree reconstruction, since it 
is necessary to align pairs of close homologs at the leaves, and 
multiply align many remote homologs at the interior nodes of 
the tree. Therefore, a simple (yet realistic) evolutionary dynamic 
that is consistent across a wide range of divergence times, and 
that leads naturally to sequence-sequence, sequence-profile and 
profile-profile alignment algorithms, may be advantageous. 



Mathematical Details 

A collection of homologous residues can be represented by 
a 20 component canonical amino acid count vector, n — 
{ni, . . . , n2o}. The total number of counts can be 1, if the obser- 
vation is taken from a single sequence, or many if the collection 
represents an entire column of a multiple sequence alignment or 
some other related set of residues. 

In general, we wish to estimate whether two collections of ho- 
mologous residues are related, given that detectably homologous 
residues are drawn from the same background amino acid distri- 
bution. The appropriate test statistic is the log odds of sampling 
the two amino acid count vectors from the same, but unobserved, 
background distribution, against the probability of independently 
sampling the two count vectors from different distributions. 

5(n\n^) = log ^^"y (3) 

The probability of independently sampling a particular collec- 
tion of homologous residues n from the background amino acid 
profile from which those residues are drawn, 6 — {d\^ . . . , 6*20}, 
follows the multinomial distribution; 

P{n\B) = M(n\e) 

-,20 rr I 

This is the multivariant generalization of the common binomial 
distribution. 

The probability distribution of background distributions P{Q) 
has been studied and measured by collating columns from many 
multiple protein sequence alignments. Since this is a very large, 
multi-modal probability it is necessary to parameterize the dis- 
tribution into a convenient representation. Typically, a mixture 
of Dirichlet distributions i s used ( Karplus. 1995; Siolander ef a/.L 
ll996l:lDurbin ef aZ.LIl998l) . 

P{B)= PkV{e\a^) (5) 

fc— l,m 
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Figure 6: The substitution matrices, BLOSUM62, VTML160 and 
dist . 2 Ocomp, are of comparable effectiveness under Green 
and Brenner's (2002) evaluation of pairwise remote homology 
detection. A set of about 2800 sequences (none of which 
share more than 40% sequence identity) are collated from the 
SCOP (Structural C l assification of Proteins) dat abase (version 
1.57) dMurzin et all Il995t iBrenner et ali 120001) . SCOP reh- 
ably clusters these sequences into groups of homologs using 
structural information. Each sequence is matched against the 
dataset using Smith-Waterman alignment Ismith & Watermanl 
[Ml'), a particular substitution matrix and appropriate gap penal- 
ties (Green & Brenner, 2002). The results are shown as a plot 
of errors per query against (linearly normalized) coverage, the 
average fraction of true homologs that are found for each se- 
quence. There is a trade off between accuracy and coverage; the 
bottom right of the above graph is ideal; high coverage with few 
errors. We used Bayesian bootstrap resampling to estimate that 
the standard deviation of the coverage is about 0.02 at 0.01 errors 
per query. JCireen & Brennei{l2002tlZachariah et a/.H2004 Price, 
Crooks, Green & Brenner, Unpublished). Thus, there is a statis- 
tically significant, but relatively small and (in practice) unimpor- 
tant variation in homology coverage between the three matrices. 
Note that both BLOSUM and VTML matrices have been directly 
trained upon pairwise alignment data, and may therefore be fa- 
vored in this pairwise alignment test. 



The TO mixture coefficients pk sum to one. The kth Dirichlet dis- 
tribution is itself parameterized by the 20 component (canonical 
amino acids) non-negative vector a'', 



V{e\a) ^ 



1 



20 

z(a) n 



Z{a) 



n.r(a.) 
r(^) ' 



(6) 



where A = 

Dirichlet mixtures are used to model the background proba- 
bility partially because Dirichlet distributions are naturally con- 
jugate to the multinomial distribution (and therefore mathemat- 
ically convenient) and partially because a Dirichlet mixture can 
approximate the true distribution with a reasonably small number 
of parameters. The underlying assumption is that the probability 
distribution is smooth, but lumpy. (See Fig.|2j 

If we do not know the particular background from which the 



observations have been drawn, then we must average over all 
backgrounds to find the probability of observing a particular 
count vector; 

P{n) = I de P{n\9)P{9), 



d0 M{n\0)J2Pk'^{^\c^'')^ 



Pk 



20 



M{n) 



1 1 /■ ^" 



.{iH+a'l-l) 



Z{a^)M{n) 



(7) 



The last line follows because the product in the previous line is 
an unnormalized Dirichlet with parameters (n + a^). There- 
fore, the integral over 6 must be equal to the corresponding 
Dirichlet normalization constant, Z{n + a^). The final result 
is a mixture of multivariate negative hypergeometric distribu- 
tions jjohnson & Kotz, 1969). The negative hypergeometric is 
an un der-appreciated distribution (e.g. Eq. 1 1 .23. iDurbin et al\. 
Il998 >) which bares the same relation to the hypergeometric as the 
negative binomial does to the binomial distribution. The multi- 
variant generalization appears in this case as the combination of 
a Dirichlet and a multinomial. Confusingly, the negative hyper- 
geometric distribution is sometimes called the inverse hypergeo- 
metric, an entirely different distribution, and vice versa. 

The probability of independently sampling two count vectors, 
■n} and n?, from the same undetermined background is 



dO P{n^\e)P{n^\B)P{B), 
dO M{ny)M{n^\e) ^ Pk'D{9\a'') 

k 

v-^ 1 1 1 

l^pk- 



Z{a'') M(ni) M{n^) 



20 



E 



Pk 



d9Y[t 

i=l 

Z{n^ + n'^ + a'') 
Z(afe)M(ni)A'f(n2) 



(8) 



Combing Eqs.0and|8]with the log likelihood ratio, Eq.|3l gen- 
erates a generic profile-profile sequence alignment score that is 
valid whether the number of counts is small or large. 



Z{a'')M{n^)M(n'^) 



Y.Pk 



)_ z,(; 

■Z(a'=)M(ni) ^ Pk Z{a'')M{n'^) 
k k 



(9) 



For the particular case that one of the count vectors con- 
tains only a single observation this score reduces to the standard 
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sequence-profile score frequ ently used by hidden M arkov model 
protein sequence alignment dSiolander et al\ Il996l) . This is in- 
evitable, since the underlying mathematics is the same. 

If both count vectors contain only a single observation, then 
this profile-profile score reduces to a pairwise substitution ma- 
trix. Note, given that 71^ = S^i and ~ Sxj (where S^j is 
a Kronecker delta function), then all but the jth element of the 
product Z{dxj + a'^)/Z(a'') cancels. Thus, 

S,, = log^ 



E 



Pk 




A^{A^ + l) 



(10) 



Applying Eq. [TO] to the 20 component Dirichlet mixture 
dist . 2 Ocomp generates the pairwise substitution matrix illus- 
trated in Fig.|3l 

An interesting feature of this model is that it provides a uni- 
fied homology match score for sequence-sequence, sequence- 
profile and profile-profile alignment (Eq. |9j. As far as we 
are aware, this profile-profile score has not been evaluated in 
a profile-profile alignment algorithm, although it is a natural 
generalization of the established hidden Markov model profile- 
sequence score. However, in the large sample limit Eq.|9] re- 
duces to the Jensen-Shannon divergence between the two em- 
pirical amino acid distributions, a mea sure that has shown 
some promise in profile-profile alignment (I Yona & Levitl l2002t 
lEdear & Siolandeitl2004[fMarti-Renom et a/.Ll2004 . 
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